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ABSTRACT 
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computer-based tutor. This paper reports on 3 construct validity 
studies that have been conducted with 97 high school students in 
order to demonstrate the correspondence, or lack thereof, between the 
theoretical constructs of the Diagram Configuration (DC) Model of 
geometric proof -wri ting expertise (Koedinger & Anderson, 1990) and 
the hints and errors being recorded by an instantiation of the DC 
Model called ANGLE, an intelligent geometric proof tutor. Results of 
the studies supported the appropriateness of construct validity 
techniques for analyzing ITS data. The results partially confirm a 
hypothesized factor structure for the data* The paper concludes with 
a discussion of the results, including suggestions for modifications 
of the ANGLE program. (Contains nine references and six tables,) 
(Author/SLD) 
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ABSTRACT 

The National Council of Teachers of Mathematics (1991) has identified the use of 
computers as a necessary teaching tool for enhancing mathematical discourse in 
schools. One possible vehicle of technological change in mathematics classrooms 
is the Intelligent Tutoring System (ITS), an artificially intelligent computer-based 
tutor. This paper reports on construct validity studies that have been conducted 
in order to demonstrate the correspondence, or lack thereof, between the 
theoretical constructs of the Diagram Configuration Model of geometric proof- 
writing expertise (Koedinger & Anderson, 1990) and the hints and errors being 
recorded by an instantiation of the DC Model called ANGLE, an intelligent 
geometric proof tutor. Results of the studies supported the appropriateness of 
construct validity techniques for analyzing ITS data. The results partially 
confirmed a hypothesized factor structure for the data. The paper concludes 
with a discussion of the results, including suggestions for modifications of the 
ANGLE program. 



USING CONSTRUCT VALIDITY TECHNIQUES TO EVALUATE AN 
AUTOMATED COGNITIVE MODEL OF GEOMETRIC PROOF WRITING 

INTRODUCTION 

The National Council of Teachers of Mathematics (1991) has identified the 
use of computers as a necessary teaching tool for enhancing mathematical 
discourse in schools. Through the use of technology, NCTM (1989) envisions the 
transformation of classrooms into laboratories for experimentation and 
exploration, with the consequent altering of the teacher's role to that of a partner 
in and facilitator of student discovery. One possible vehicle of technological 
change in mathematics classrooms is the Intelligent Tutoring System (ITS), an 
artificially intelligent computer-based tutor. 

A New Geometry Learning Environment, or ANGLE, is an ITS that was 
specifically developed as a testbed for a new schema-based cognitive theory of 
geometric proof-writing called the Diagram Configuration (DC) Model 
(Koedinger & Anderson, 1990). The system is capable of collecting a large 
volume of on-line data as students are engaged in problem solving, and this 
information is used by ANGLE to maintain a model of student cognitions. 
Collected data includes the number and type of errors committed by a student, 
the number and type of hints requested by the student, and the time needed to 
solve a particular problem. Table 1 and Table 2 present a list of error and hint 
message-types, the abbreviations that will be used in this paper, their meanings, 
and an example of each. 



Insert Tables 1 and 2 About Here 



Geometric knowledge in the DC Model is not assumed to be hierarchical, 
but rather it is theorized to be organized according to diagrammatic schemas. 
An example of a schema, shown in Figure 1, might be the situation 

Insert Figure 1 About Here 

CONGRUENT-TRIANGLES-SHARED-SIDE, in which two congruent triangles 
share a common side. The DC Model posits that there exist three major processes 
or constructs in geometric proof-writing: 

1. Diagram Parsing - Identifying familiar configurations in the problem 
diagram and instantiating the corresponding schemas (a body of 
geometric facts associated with the diagram); 

2. Statement Encoding - Comprehending given and goal statements by 
canonically representing them as part-statements (relationships among 
parts of the diagram, e.g, AB=DB in Figure 1); 

3. Schema Search — Iteratively applying schemas in forward or backward 
inferences until a link between the given and goal statements is found. 

A primary goal of the studies of this paper was to demonstrate the 
correspondence, oi 'ack thereof, between the theoretical constructs of the DC 
Model and the hin * and errors being recorded by ANGLE. 

PURPOSE 

As ITS programs become a reality in mathematics classrooms, and as ITS 
developers become more willing to conduct extensive classroom evaluations of 
their programs, it will be increasingly important to ensure that there is a viable 
method of evaluating the validity of the data being gathered by these systems. 
Further, as data from these models are made available to teachers for use in 
student assessment, it will be important to avoid the use of redundant or 



ineffective measures of student performance which might degrade the benefit of 
computer-assisted evaluations. 

The focus of this paper is on the usefulness of construct validation 
methods for evaluating ITS cognitive models of mathematical performance. The 
goal of a construct validity study is to maximize the descriptive power of test 
scores for an individual by maximizing evidential support for the adequacy and 
appropriateness of the inferences made from those scores (Messick, 1989). 
Construct validity studies are fundamentally theory-based. That is, for the 
conclusions of such studies to have any meaning or significance, the traits and 
measures to be investigated must be supported in theory, and the results of the 
studies must be interpreted in light of that theory. This paper argues that 
because ITS automated cognitive models are often instantiations of cognitive 
learning theories, construct validity studies are appropriate means for enhancing 
the meaningfulness of ITS on-line data used to generate such models. The 
questions to be answered by construct validity studies of ITS data include: 

- To what degree is the ITS measuring its target constructs? 

- Is the tutor measuring any unintended constructs? 

- Can some student measures be modified or eliminated from ITS 
programming due to their ineffectiveness or redundancy in measuring 
constructs of interest? 

METHOD 

A laboratory test was conducted at Carnegie-Mellon University during the 
Summer of 1990, in which student performance with the ANGLE ITS was 
compared with student use of a first generation Geometry Proof Tutor (GPT) 
(Koedinger & Anderson, 1993a). Thirty students participated in eight hours of 
instruction over two weeks, half using ANGLE and half using the GPT. Only the 
ANGLE portion of the data was used in the first study of this paper. 



ERIC 



6 



The purpose of the first study was to determine whether data from 
ANGLE could be considered sufficiently similar to test scores for an investigation 
of their construct validity. It was hypothesized that for a subset of similar 
difficulty problems, the number of error and hint messages received by students 
would not change appreciably over time as a result of feedback provided by 
ANGLE. This study involved 15 students solving six medium-difficulty proofs 
in a laboratory setting (see Table 3 for a listing of problem-types). Two separate 
repeated measures ANOVA analyses were conducted, one using counts from 



Insert Table 3 About Here 



six error messages and the other employing counts from four hint messages. 
Each analysis utilized the six problems as six measurement occasions. 

A classroom evaluation of ANGLE, involving a total of sixty students, was 
conducted at a Pittsburgh public high school during April and May, 1992 
(Koedinger & Anderson, 1993b). The second study reported here involved 42 
students using ANGLE to solve one medium-difficulty proof (Figure 2). The aim 
of the study was to explore the factor structure of counts from three hint 
messages and three error messages that were hypothesized specifically to 
measure the constructs of the DC Model. Exploratory factor analysis, as well as 
judgmental and logical analyses were used for this purpose. 



Insert Figure 2 About Here 



The third study analyzed data from the most recent classroom evaluation 
of ANGLE. Data were collected at a Pittsburgh public high school during April 



and May, 1993 using a somewhat larger sample than was available for the first 
classroom assessment, The third study of this paper employed the on-line 
protocols recorded for 40 students who solved the same problem analyzed in the 
second study. Confirmatory factor analysis was employed in order more fully to 
determine the validity of the interpretation of the factors identified in the second 
study. 

RESULTS 

First Study 

The first study involved a repeated measures ANOVA, using data from 
each of six problems of similar difficulty-level as six measurement occasions. 
The analysis employed the ANGLE total message counts for each individual as 
the dependent variable and occasion (message and problem) as the independent 
variable. Two separate analyses were conducted, one for error messages and one 
for hint messages, using Release 4.1 of the statistics package SPSS-X (SPSS, 1988). 
The variable PSENTER was dropped from consideration because of its zero 
variance for some of the problems. Additionally, one problem for each of three 
students had to be dropped from consideration because the time to solve the 
problem was recorded as zero, indicating an ANGLE malfunction for that 
problem. , 

For the main effect of occasion, there were no significant multivariate or 
univariate results for the dependent measures of either analysis (p>.05). The two 
analyses produced five combinations of occasions for each error or hint variable. 
For a conservative test, the five combinations for each variable were considered 
as one scale with a p-level of .05/5=.01 for each combination. Out of 30 possible 
combinations of occasions for six error messages, only WPJUSTER had 
significant F value (p=.009) for the combination involving problems 1, 2, 3 and 6. 
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None of 20 possible combinations of occasions for four hint messages were 
significant at the .01 level. 

Overall, the repeated measures analyses indicated the general consistency 
of both error and hint messages across a subset of six similar difficulty level 
problems. These results provided evidence to support the assumption that, for a 
subset of six problems of similar difficulty, error and hint messages did not 
change appreciably for individuals as a result of tutor feedback. That is, ANGLE 
on-line measures could be considered sufficiently similar to test scores for an 
investigation of their construct validity. 
Second Study 

For the classroom evaluation, the execution phase of proof construction 
was turned off by its developers, eliminating the recording of the on-line 
variables EXJUSTER and EXJUSTHT. The second study began with a principal 
components analysis and an exploratory factor analysis. Both analyses were run 
using Release 4.0 of the statistics package SPSS-X (SPSS, 1988). The Kaiser- 
Meyer-Olkin measure of sampling adequacy for the dataset was .66571, which is 
satisfactory (Tabachnick & Fidell, 1989). 

Since PSENTER was only experienced once by one student, and 
MSJUSTER and MSIOER could not be considered to measure one of the 
hypothesized constructs, only six variables were included in the principal 
components and factor analyses. One error and one hint message were 
hypothesized by this author to measure each of the three theoretical constructs of 
the DC Model (Figure 3). The constructs are depicted as ovals and the ANGLE 
measures as rectangles. The curved arrows represent simple correlations 
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between constructs and the longer straight arrows indicate the hypothesized 
causal relationships between constructs and measures. The smaller arrows at the 
bottom of the figure portray residual variances from unspecified influences (e.g., 
measurement error). 

The principal components analysis calculated a correlation matrix for the 
six messages, as well as eigenvalues and explained variance statistics, which are 
shown in Table 4. Note that for a principal components analysis, values of 1 are 
placed in the diagonal of the correlation matrix; these values are replaced with 
squared multiple correlations for factor analysis. The principal components 

Insert Table 4 About Here 



analysis extracted two factors, each with an eigenvalue greater than 1.00, which 
were capable of explaining 63 percent of the variance for the six categories of 
messages. Therefore, a two factor model was explored using both a 
VARIMAX (orthogonal) and OBL1MIN (oblique) rotation. For the OBLIMIN 
solution, the two factors were found to have a correlation of .47823, or a 
substantial 23% overlap in variance (Table 5). Additionally, all loadings for the 
VARIMAX rotation were positive, a further indication that an OBLIMIN solution 
was appropriate for the data. 



Insert Table 5 About Here 



All three error messages and one hint message (SCJUSTHT) loaded on 
Factor 1, while the two remaining hint messages loaded on Factor 2. The 
Cronbach-alpha value for Factor 1 was .78, which was considered reasonable* the 
same consistency measure was only .38 for Factor 2, denoting a lack of stability 
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for the measures of the factor. The error and hint messages hypothesized to 
measure the construct of Schema Search both loaded on Factor 1, indicating a 
possible identity for the factor. 

In an effort to further identify the factors produced in each of the above 
analyses, judgmental and logical evaluations were made of the ANGLE data, 
illuminated by the literature on ANGLE. With regard to CFENTER and 
PSJUSTER loading on Factor 1, Koedinger and Anderson (1993b) have made the 
point that "... a few students occasionally took a rather mindless trial-ana-error 
approach to working with ANGLE." (p. 247). This approach involved students 
using feedback received from ANGLE in order to guide them in their proof- 
writing. The developers' observation was supported by the loading of the error 
messages WPJUSTER, CFENTER and PSJUSTER on a single factor. 

Nevertheless, the identity of Factor 1 as primarily Schema Search is still 
reasonable given the strong loadings of WPJUSTER and SCJUSTHT. Whereas it 
might have been relatively easy (though time-consuming) for novices to 
instantiate schemas and choose part-statements by intentionally making errors, it 
proved to be much more difficult to establish ways-to-prove in the same manner; 
hence the loading of SCJUSTHT with WPJUSTER on Factor 1. It appears that 
the Schema Search aspect of proof writing presented more challenges for the 
novices of the classroom study than other proficiencies. 

Given the fact that the Cronbach-alpha coefficient was low for the 
measures SCSELHT and PSJUSTHT, any rival hypothesis concerning Factor 2 
would be considered tenuous at best. There is certainly a relationship between 
the two hint options, and that relationship might be interpreted as method- 
specific variance produced by a dependence of some students on SCSELHT and 
PSJUSTHT to help them get through the schema selection and part-statement 
justification portions of the proof without making errors. Overall, though, the 
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validity of the interpretation that the two hint variables measured either of the 
constructs Diagram Parsing or Statement Encoding was not supported in the 
results. 
Third Study 

A confirmatory factor analysis was run using Version 3.0a of the EQS 
Structural Equation Program (Bentler, 1989). The analysis used the factor 
structure specified in Table 5 in an attempt to confirm the measurement model's 
goodness-of-fit for a different sample of students. Table 6 contains the results of 
the confirmatory analysis, including goodness-of-fit indices, factor loadings and 
standardized residuals. 

The Bentler-Bonett Normed Fit Index value of .974 exceeded the minimum 
recommended value of .90 (Tabachnick & Fidell, 1989), an indication of the 
appropriateness of the measurement model. The Chi-Square test and the Bentler- 
Bonett Nonnormed Fit Index agreed with this result. The factor loading and 
standardized residuals results reveal that PSJUSTER is the primary cause of a 
lack of fit of the model to the data. Note that relative loadings of the variables in 
the confirmatory analysis agree with those of the exploratory analysis, though 
the actual values were lower for the Factor 1 measures. It should be pointed out 
that the variance for each of the variables of the dataset was low; therefore, minor 
differences in student response patterns from those of the dataset could have a 
significant impact on the fit indices reported for this analysis. 

DISCUSSION 

A primary goal of any construct validity study is to provide information 
concerning the strength of measures of traits or abilities, so that those measures 
may be reformulated or refined to measure more accurately the constructs of 
interest. The results of the second and third studies suggested that the 
unanticipated loading of all error messages provided by ANGLE on a single 
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factor was reflective of the fact that students were using those messages to guide 
their proof-writing. It is certainly possible that some of the students consciously 
used the error messages as a means of learning how to write correct proofs, but 
given the developers' observations of students at work with ANGLE (Koedinger 
& Anderson, 1993b), it seems unlikely that use of error feedback was anything 
more than a "work-around" in order to get through the proof. 

Given the above, it is useful to speculate on changes that might be made in 
the design of ANGLE in order to force students to take a less random approach 
to proof-writing. Since "guess-and-check" is a legitimate problem-solving 
strategy presently being taught in school mathematics, it would probably nut be 
a good idea to simply penalize guessing by, say, shutting off the feedback option 
at some point in a proof. Therefore, it might be advisable to establish a 
beginning point total or score for a given proof. Each error committed or, 
possibly, each hint requested (though this is really a separate issue) would 
deduct from the total score for the problem. Thus, students would have two 
goals for any proof: write the proof correctly, and maintain the point total in 
some acceptable range. Writing proofs then becomes a game to be won and thus 
hopefully more motivating and involving for students. 

Another issue arising from the second and third studies was the lack of 
evidence for an interpretation of any variables actually measuring the constructs 
Diagram Parsing and Statement Encoding. It may be that the loadings of the 
error measures would change if the use of error feedback was more regulated, as 
suggested above. It may also be that the measures are simply too global to 
effectively measure the traits of interest. That is, a category of error messages 
such as PSJUSTER might need to be subdivided into narrower categories of 
messages that reflect more clearly the different stages of proof construction; this 
would be especially important for more complicated proofs involving many 
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steps. Further, the viability of subdividing different categories of error and hint 
messages might then suggest the possibility of the existence of more constructs. 
This possibility could be investigated via additional exploratory analyses, such as 
those of the second study. 

As ITS programs become a reality in mathematics classrooms, and as ITS 
developers become more willing to conduct extensive classroom evaluations of 
their programs, it will be increasingly important to ensure that there is a viable 
method of evaluating the validity of the data being gathered by these systems. 
Further, as data from these models are made available to teachers for use in 
student assessment, studies such as those of this paper will also be helpful in 
avoiding the use of redundant or ineffective measures of student performance 
which might degrade the benefit of computer-assisted evaluations. 

It should be understood that the evaluation methods being suggested in 
this paper are intended to be iterative in nature, and thus it is unrealistic for an 
ITS developer or evaluator to believe that one or two evaluation cycles would be 
sufficient to determine fully the construct validity of inferences drawn from 
recorded variables. Nonetheless, if the ultimate goal is to develop an artificially- 
intelligent system which is capable of meaningfully interpreting the cognitive 
processes of its users, construct validity studies represent an accessible, theory- 
based approach to evaluating automated cognitive models. 
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Table 1 . ANGLE Error Messages 

Configuration Entry Error (CFENTER) - Failing to correctly 
identify/ enter a diagram configuration 

Example: "This concept [e.g., CONGRUENT-TRIANGLES-SHARED-SIDE] 
does not appear in the diagram." 

Part-Statement Entry Error (PSENTER) - Failing to correctly enter a 
part-statement 

Example; "If the diagram is drawn accurately, angles which don't look equal 
cannot be proven equal." 

Part-Statement Justification Error (PSJUSTER) - Failing to correctly 
justify a part-statement 

Example: "To justify a statement, like DB=HF, you need to use a concept in 
which DB=HF is a part-statement." 

Ways-to-Prove Justification Error (WPJUSTER) - Failing to correctly 
justify a schema 

Example: "The statements you selected are part-statements of 
AABF=ACBG, but they do not match any of the ways-to-prove." 

Execution Justification Error (EXJUSTER) - Failing to correctly justify or 
insert a rule 

Example: "The REFLEXIVE rule does not need any premises. It is justified 
by the diagram." 

Miscellaneous Justification Error (MSJUSTER) - Other justification 
errors 

Example: "You are trying to use APQW=APSW to prove itself. That line of 
reasoning is circular." 

Miscellaneous Input/Output Error (MSIOER) - Other interface errors 
Example: "To finish selecting premises, click on DONE or ABORT." 
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Table 2 . ANGLE Hint Messages 

Schema Selection Hint (SCSELHT) - Hints for entering a schema 

Example: "Try to find two triangles which look congruent." 

Part-statement Justification Hint (PSJUSTHT) - Hints for entering and 
justifying part-statements 

Example: "Look for OVERLAPPING concepts. That is, look for a part- 
statement which appears in both AABF=ACBG and in a concept you've 
already proven." 

Schema Justification Hint (SCJUSTHT) - Hints for justifying a schema 

Example: "Find proven part-statements of AABD=AEFH and use them to 
justify it." 

Execution Justification Hint (EXJUSTHT) - Hints for adding rules 
Example: "Prove RQ=RS using the CORRES-PARTS rule." 
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Figure 1. Diagram of ANGLE'S CONGRUENT-TRIANGLES-SHARED- 
SIDE Schema 



B 




C 
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Table 3 . ANGLE Problems Analyzed During the First Study 



Problem 1 - Prove triangles congruent using Angle- Angle-Side (AAS); 

Problem 2 - Prove triangles congruent using AAS and then prove angles 
congruent using Corresponding Parts of Congruent Triangles are 
Congruent (CPCTC); 

Problem 3 - Prove triangles congruent using AAS; 
Problem 4 - Prove triangles congruent using AAS; 
Problem 5 - Prove triangles congruent using Side- Angle-Side (SAS); 

Problem 6 - Prove triangles congruent using SAS and then prove angles 
congruent using CPCTC. 
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Fi gure 2 . ANGLE Problem Analyzed for the Second Study 




GIV£NS: <ADX=<XAD 
<ADC=<BAD 

GOAL: AB=DC 
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Fi gure 3 . Hypothesized Path Diagram Model for ANGLE 
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Table 4 . Principal Components Analysis Results for ANGLE 



Items (messages) 
CFENTER PSJUSTER WPJUSTER SCSELHT PSJUSTHT SCJUSTHT 



CFENTER 


1.00000 






PSJUSTER 


.31727 


1.00000 




WPJUSTER 


.38579 


.35378 


1.00000 


SCSELHT 


.17924 


.21021 


.15904 1.00000 


PSJUSTHT 


.27761 


.07580 


.37233 .34715 1.00000 


SCJUSTHT 


.44394 


.39829 


.84451 .22094 .30404 


Factor 


Eigenvalue 


Percent of Variance Cumulative Percent 


1 


2.72597 




45.4 45.4 


2 


1.05621 




17.6 63.0 
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Ta ble 5 . Rotated Factor Matrices for ANGLE 



Pattern Matrix: 



MESSAGE-TYPE 


FACTOR 1 


FACTOR 2 


SCJUSTFT 


.99010 


-.09946 


WPJUSTER 


.92111 


-.09057 


CFENTER 


.41881 


.15942 


PSJUSTER 


.38966 


.08262 


SCSELHT 


-.06103 


.68919 


PSJUSTHT 


.17636 


.43905 


ure Matrix: 






MESSAGE-TYPE 


FACTOR 1 


FACTOR 2 


SCJUSTHT 


.94254 


.37404 


WPJUSTER 


.87780 


.34993 


CFENTER 


.49505 


.35971 


PSJUSTER 


.42917 


.26897 


SCSELHT 


.26855 


.66000 


PSJUSTHT 


.38633 


.52339 



Table 6 . Results of the ANGLE Confirmatory Factor Analysis 



Factor Loadings with Standardized Residuals: 



VARIABLE FACTOR! FACTOR2 RESIDUALS 

SCJUSTHT .580 -- .815 

WPJUSTER .468 - .884 

CFENTER .363 -- .932 

PSJUSTER .205 -- .979 

SCSELHT -- 1.000 .000 

PSJUSTHT -- .970 .244 



Goodness-of-fit Indices: 

Chi-Square = 3.066 with 8 d.f. Non-Significant (p=.93012) 
Bentler-Bonett Normed Fit Index = 0.974 
Bentler-Bonett Nonnormed Fit Index = 1.088 
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